QSAR study of some 5-methyl/trifluoromethoxy- 1H-indole-2,3-dione-3-thiosemicarbazone derivatives as anti-tubercular agents

In the present study, quantitative relationships between molecular structure and anti-tubercular activity of some 5-methyl/trifluoromethoxy-1H-indole-2,3-dione-3-thiosemicarbazone derivatives were discovered. The detailed application of an efficient linear method and principal component regression (PCR) for the evaluation of quantitative structure activity relationships of the studied compounds is demonstrated. Components produced by principal component analysis were used as the input for a linear model development. Results indicate a linear relationship between the principal components obtained from molecular descriptors and the inhibitory activity of this set of molecules. The maximum variance in the activity of the molecules in PCR method was 73%. The performance of the developed model was tested by several validation methods.


INTRODUCTION
Tuberculosis (TB) is an infectious condition with high degree of mortality all over the world.According to the World Health Organization one-third of the world's population was infected with Mycobacterium tuberculosis in 2006 (1).Other statistics indicate that 8 million of new cases are estimated to appear annually, 95% of which occurs in developing countries, and almost 2 million of the sufferers are believed to die from the disease (2,3).A large number of the infected people bear the latent form of TB which produces a possible risky source of the sickness for the future generations.The HIV epidemic has resulted in the fast increase of the TB pandemic and has enhanced the probability of death from TB.The emergence of multidrug resistant tuberculosis to the first-line drugs such as isoniazid, rifampicin, ethambutol, streptomycin and pyrazinamide has made the disease hard to cure (4)(5)(6).This obstacle in the treatment of TB and the statistical facts about its prevalence highlights the necessity of designing novel, more potent and less prone to resistance compounds with fewer side effects.
The design of new anti-tubercular drugs may be accomplished using the sophisticated computational methods which are generally based on two different approaches.Designing new ligands for inhibiting a known biochemical target by noticing its structural features which seems a more ideal approach and proposing novel biologically active compounds after statistical analyses of the structures of some known drugs having the desired biological activities.The first approach is called structure-based drug design, and the other is referred to as ligand-based drug design (LBDD) (7).Quantitative structure activity relationship (QSAR) is a method of the LBDD approach.As one of the most powerful techniques for predicting the bioactivity of various molecules, QSAR starts only with the molecular structure information of the previously reported active molecules.Such a data mining approach offers important insight into the relationships between the molecular structure and biological activity of the investigated compounds by means of statistical models.Various methods have been applied to construct QSAR models including linear and nonlinear regression methods (8,9).
Here we investigate the quantitative structure activity relationships of a series of 49 methyl/trifluoromethoxy 1H-indole-2,3-dione-3-thiosemicarbazone derivatives reported in literature as anti-tubercular compounds.Principal component regression (PCR) as a method for feature extraction was employed, and its output was used as the input of linear regression.After reaching the model it was validated using various model validation techniques.

Data preparation
All calculations were performed in a Pentium IV personal computer (CPU at 2.6 GHz) with Windows XP operating system.All biological data used here were derived from the literature (10).General structures and the structural details of these compounds are reported in Table 1.pIC 50 (log 1/IC 50 ) is the dependent variable that characterizes the biological parameter for the developed QSAR model.The structures of molecules were drawn and optimized using Hyperchem 7.0 software (11).Semi-empirical AM1 method with Polak-Ribiere algorithm until the root mean square gradient of 0.01 kcal/mol was the optimization method.The resulted geometries were transferred into Dragon program, version 2.1 (developed by Milano Chemometrics and QSAR Group) to calculate some descriptors (12).A large number of theoretical different descriptors were calculated for each molecule.The name and number of calculated descriptors can be seen in the Table 2.These four classes of descriptors were selected to ease the calculations.All the required model development calculations were performed within the MATLAB (version 7.1, MathWorks, Inc.) environment.
In this study, the Kennard and Stone algorithm was employed for assigning the training and test sets (13).This method of data splitting has some advantages: the training set molecules map the measured region of the input variable space completely with respect to the induced metric.The other advantage is that the test molecules all fall inside the measured region.
Principal component analysis (PCA) was employed to compress a pool of calculated descriptors into principal components (PCs) as new variables.
As a matter of fact, PCR is a multiple linear regression method which uses the score matrix as new variables for building of the model.
After the building of the model, its validation is a crucial part of any QSAR procedure.In other words, after the calculation of the regression coefficients (b) by the least square methods, the coefficients are used to predict the activity of external test set.For example, if X c and X t are the matrices of factors for calibration and test sets respectively, the Y c and Y t matrices of activity for calibration and test sets can be obtained using the following equations: The statistical qualities of the generated QSAR models were evaluated using methods such as leave one out cross validation and parameters like predicted residual sum of squares (PRESS) and the root mean square error (RMSE).
Leave one out cross validation is a wellknown and accepted method applied to discover the reliability of the generated QSAR models.Based on this method, a number of modified data sets (equals to the number of the studied molecules) are generated by removing one of the molecules in each case.For each new data set, a model is generated using the modeling procedure applied in the study.Each model is examined through the evaluation of its power in predicting the bioactivity of deleted molecule.This process is repeated until a total set of predicted bioactivity for all of the investigated molecules is achieved.The predictive ability will be evaluated by the cross validation coefficient (R 2 cv ) calculated using the following equation: Root mean square error of cross validation (RMSE CV ) for the developed models is reported, as well Some criteria for the prediction of the model are suggested by Tropsha.If these criteria are satisfied, it can then be concluded that the model is predictive (14).These criteria include: R 2 LOO >0.5 where, R 2 is the correlation coefficient of regression between the predicted and observed activities of compounds in training and test set.

R
is the

Functional group
In addition, according to Roy and Roy (15), it is necessary to study the differences between the values of 2

RESULTS
After deleting zero variance columns of X block, PCA was carried out on the pool of all descriptors.As evident in Table 3, among the generated PCs, only 9 eigenvalue ranked PCs were selected for the next model building.Table 3 demonstrates that PCA gives 9 significant PCs (% variance explained >1) which can explain more than 95.17% of the variances in the original descriptors data matrix.Nine PCs with their eigenvalues are shown in the Table 3.In this Table, the eigenvalues, the percent of variances explained by each eigenvalue, and the cumulative percent of variances are represented.Therefore, the subsequent studies were restricted to these 9 PCs, and the selection of their best subset to perform the linear regression method.Plotting the first PC vs. the second one showed that none of the compounds used in this study were outlier, although 5 clusters existed within the data set (Fig. 1).
When factor scores were used as the predictor parameters in a multiple regression equation using stepwise selection method of PCR, the following equation was obtained: pIC 50 = 14.021 (± 2.143) + 12.210 (± 3.216) f1+ 1.453 (± 0.320) f3 R 2 = 0.83, S.E.= 0.20, F = 20.05where, F is F of ANOVA and S.E. is standard error in the resulted model.
The above equation shows high equation statistics (81% explained variance in pIC 50 data).Since factor scores were used instead of selected descriptors, and each factor-score contains information from different descriptors, loss of information is avoided, and the quality of PCR equation is better than similar equations such as those derived from MLR.
The cross-validation method was used to evaluate the robustness of the proposed model.In leave one out cross validation method, one object at a time was eliminated, and then PCR was performed on the remaining of training set.The activity of the left-out object was predicted using this regression model.This procedure was repeated until each compound in the calibration set was left out once.The optimum number of factors was selected with respect to the quantities of RMSECV, the root mean square error of calibration, and 2 PCs were selected as the optimum number of PCs.
For the evaluation of the predictive power of the generated PCR, the optimized model was applied for the prediction of pIC 50 values of all compounds in the calibration and prediction set.The calculated pIC 50 for each molecule and relative error of prediction by model are summarized in Table 4. Very small values of relative errors confirm the accuracy of the proposed PCR model for modeling the anti-tubercular activity of the studied compounds.

DISCUSSION
To obtain the effects of the structural features of the studied derivatives on their anti-tubercular activity, QSAR model development was performed with various calculated molecular descriptors (8).Because of the large number of calculated descriptors, PCA was employed to solve the collinearity problem in the generated descriptors, and the PCs were used as new variables for model building.In the PC analysis at first a data preprocessing step must be performed on the descriptors calculated by dragon using autoscaling.Suppose X i,j is the column meancentered and scaled matrix of descriptors for i samples and j descriptors, and y i,1 the matrix of the activity (pIC 50 ).After the generation of principal components, using matrix X i,j , the new matrix containing scores of PCs is created.Then these scores are used as new variables for regression.Scores as new variables possess two interesting properties (8): (i) They are sorted as the information content (variance) explaining decreases from the first PC to the last one of the PCs.As a result, the last PCs can be deleted, since they don't have useful information.(ii) PCs are orthogonal, so the correlation problem that exists in the pool of descriptors calculated in this study is solved.
After the calculation of PCs, these factors are used as new variables in the building of the model.In order to evaluate the final developed model, the existing data set was divided into training and external prediction (test) sets.Almost 20% of the molecules (9 out of 49) were selected as external test set molecules.The training set plays an important role in developing the properties of the model.
The best situation in this stage of the model building is splitting of the data set to guarantee that both training set and test set individually cover the total space occupied by the original data set.The possibility of overfitting of the developed model is increased by the selection of more similar molecules as training set.Hence, ideal splitting of data set can be performed in such a way that each object in the test set is close to at least one of the objects in the training set.One of the best methods for data splitting is using Kennard and Stone algorithm.After dividing the molecules into two parts, training and test sets, based on Kennard and Stone algorithm, building of the regression models using the calibration set was performed.
In order to get the linear relationship with independent variables, logarithms of the inverse of the biological activity (log 1/IC 50 ) data of 49 molecules were used.
An exact consideration of different statistical parameters indicated that the developed QSAR model could explain and predict 73% and76% of the variances in the pIC 50 in training and test sets data, respectively.It was observed that, the plot of data resulted by PCR represents the lowest scattering, with no systematic error As shown in Table 5, R 2 that is an indicative of the goodness of the fitting of the proposed model, was obtained for three sets, and the high value of this parameter indicates a good fitting between the PCs and the predicted values of anti-tubercular activities of the investigated compounds by developed PCR model.This shows the high predictability of the proposed model.

CONCLUSION
Quantitative relationships between the mole-cular structure and the inhibitory activity of the series of some 5-methyl/trifluoromethoxy-1H-indole-2,3-dione-3thiosemicarbazone derivatives as antitubercular agents were discovered by the collection of the calculated descriptors including topological, geometrical, constitutional, and functional group.As a result, it was found that correctly opted and designed PCR model could practically represent dependence of the 5-methyl/trifluo-romethoxy-1H-indole-2,3-dione-3-thio-semicarbazone derivatives as anti-tubercular compounds to the extracted PCs from various geometrical, topological, and other calculated descriptors.The optimized principal regression method could simulate the linear relationship between pIC 50 value and the PCs.

R
value for the given model is >0.5, it indicates good external predictability of the developed model.

Fig. 1
Fig.1The first two components (PC1, and PC2) from the principal component analysis of the 49 considered molecules.

Fig. 2
Fig. 2 Plot of calculated vs. experimental activity of investigated compounds in training and test sets.
. The external predictability of a proposed model is generally tested using test sets and R 2 CV .The satisfactory prediction of the values of the inhibitory activity of test set compounds demonstrates the efficacy of the QSAR in predicting the activities of external molecules.Moreover, the low values of RMSE and PRESS for training, and test sets also add to the statistical significance of the developed models.Besides, on the basis of criteria recommended by Tropsha and also R m .N: Number of objects in data set, R 2 : Correlation coefficient of experimental and predicted activities, RMSE: Root mean square error: Predicted error sum of square: coefficient of leave one out cross validation, RMSEcv: Root mean square error of cross validation

Table 1 . (continued) Compound R 1 R 2 R 3 Type
a Molecules assigned by Kennard and Stone algorithm as test set

Table 2 .
Some descriptors used in model building.

Table 3 .
Eigenvslues of calculated PCs, % of explained variances and cumulative variances.
correlation coefficient for the regressions between observed versus predicted activities through the origin, and the slopes of the regression lines through the origin are assigned by k and k', respectively.Details of the definitions of parameters such as2

Table 4 .
Calculated activities by PCR and their relative error of prediction (REP).

Table 5 .
Statistical parameters obtained for the developed model for anti-tubrecular activity of the investigated compounds.